L07 Layers

Data Visualization (STAT 302)

Author

JULIANNA WANG

Overview

The goal of this lab is to explore more plots in ggplot2 and continue to leverage the use of various layers to build complex and well annotated plots.

Datasets

We’ll be using the tech_stocks.rda dataset which is already in the /data subdirectory in our data_vis_labs project.

We have a new dataset, NU_admission_data.csv, which will need to be downloaded and added to our /data subdirectory.

We will also be using the mpg dataset which comes packaged with ggplot2 — use ?ggplot2::mpg to access its codebook.

Code
# load package(s)
library(tidyverse)
library(patchwork)

# load datasets
load('data/tech_stocks.rda')
admin_data <- read_csv(file = 'data/NU_admission_data.csv') |>
  janitor::clean_names()
data(mpg)

Exercise 1

Using mpg and the class_dat dataset created below, recreate the following graphic as precisely as possible in two different ways.

Hints:

  • Transparency is 0.6
  • Horizontal spread is 0.1
  • Larger points are 5
  • Larger points are “red”
Code
# additional dataset for plot
class_dat <- mpg |>
  group_by(class) |>
  summarise(
    n = n(),
    mean_hwy = mean(hwy),
    label = str_c("n = ", n, sep = "")
  )

Plot 1 – using mean_hwy

Solution
Code
# building the plot 
ggplot(data = mpg, aes(x = class, y = hwy)) +
  geom_point(position = position_jitter(width = 0.1)) +
  geom_point(data = class_dat, aes(y = mean_hwy), size = 5, color = 'red', alpha = 0.6) +
  geom_text(data = class_dat, aes(x = class, y = 10, label = label)) +
  scale_y_continuous(
    name = 'Highway miles per gallon',
    expand = c(0, 0),
    limits = c(8, 45),
    breaks = seq(10, 40, 10), 
  ) + 
  xlab('Vehicle Class') + 
  theme_minimal()

Plot 2 – not using mean_hwy

Solution
Code
# building the plot, without using mean_hwy
ggplot(data = mpg, aes(x = class, y = hwy)) +
  geom_point(position = position_jitter(width = 0.1)) +
  stat_summary(fun = 'mean', color = 'red', size = 5, geom = 'point', alpha = 0.6) + 
  geom_text(data = class_dat, aes(x = class, y = 10, label = label)) + 
  scale_y_continuous(
    name = 'Highway miles per gallon',
    expand = c(0, 0),
    limits = c(8, 45),
    breaks = seq(10, 40, 10), 
  ) + 
  xlab('Vehicle Class') + 
  theme_minimal()

Exercise 2

Using the perc_increase dataset derived from the tech_stocks dataset, recreate the following graphic as precisely as possible.

Hints:

  • Hex color code #56B4E9
  • Justification of 1.1
  • Size is 5
Code
# percentage increase data
perc_increase <- tech_stocks |>
  arrange(desc(date)) |>
  distinct(company, .keep_all = TRUE) |>
  mutate(
    perc = 100 * (price - index_price) / index_price,
    label = str_c(round(perc), "%", sep = ""),
    company = fct_reorder(factor(company), perc)
  )
Solution
Code
ggplot(data = perc_increase, aes(x = company, y = perc)) +
  geom_col(fill = '#56B4E9') +
  geom_text(aes(label = label), size = 5, hjust = 1.1, color = 'white') + 
  coord_flip() +
  xlab(NULL) + 
  ylab(NULL) + 
  theme_minimal() 

Exercise 3

Warning

Some thoughtful data wrangling will be needed and it will be demonstrated in class — Do not expect a video.

Examine the data and the plot provided in undergraduate-admissions-statistics.pdf — this pdf was collected from the Northwestern Data Book webpage. As you can see they have overlaid two plot types on one another by using dual y-axes.

There is one major error they make with the bars in their graphic. Explain what it is.

Solution

The total number of applicants doesn’t align with the number reflected in the bar plot. For instance, in the 2002, it shows there were around 14,000 applicants but the plot shows that there were 20,000 applicants because it added the number of applicants plus the number of accepted students and the number of students that matriculated.

Using NU_admission_data.csv1, create two separate plots that display the same information instead of trying to put it all in one single plot — stack them using patchwork or cowplot.

Which approach do you find communicates the information better, the single dual y-axes plot or the two separate plot approach? Why?

Solution

The two separate plot approach communicates the information better than the single dual y-axes plot. A single dual y-axes plot is helpful when there are two corresponding values, such as metric system vs. imperial system. Plus, the dual y-axes plot, visually speaking, was very busy and it was hard to understand the information that was being portrayed. The two separate plot approach communicates the same information but does so in a way that makes sense to the viewer, by splitting the admissions data into two distinct plots. Furthermore, the two separate plot approach fixes some of the issues found in the dual y-axes plot, such as the total number of applicants not lining up with what is being shown on the bar graph.

Hints:

  • Form 4 datasets (helps you get organized, but not entirely necessary):
    • 1 that has bar chart data,
    • 1 that has bar chart label data,
    • 1 that has line chart data, and
    • 1 that has line chart labels
  • Consider using ggsave() to save the image with a fixed size so it is easier to pick font sizes.
Solution
Code
# data wrangling
bar_data <- admin_data |> 
  select(-contains('_rate')) |> # - contains() == select all column that does not contain '_rate'
  pivot_longer(
    cols = -year,
    names_to = 'category',
    values_to = 'value'
  ) |>
  mutate(
    bar_labels = prettyNum(value, big.interval = ',')
  )

# building the plot
bar_plot <- bar_data |>
  filter (year >= 2002) |>
  ggplot(aes(year, value, fill = category)) +
  geom_col(mapping = aes(fill = category),
    width = 0.5,
    position = 'identity'
  )+ 
  geom_text(aes(label = bar_labels),
            size = 1.5,
            color = 'white',
            vjust = 1,
            nudge_y = -200) +
  scale_x_continuous(
    name = 'Entering Year', 
    breaks = 2002:2022,
    expand = c(0, 0.25) #L is multiplicative, R is additive
  ) + 
  scale_y_continuous(
    name = 'Applications',
    expand = c(0, 0),
    limits = c(0, 70000),
    breaks = seq(0, 70000, 10000), 
    labels = scales::label_comma()
  ) + 
  scale_fill_manual(
    name = NULL,
    limits = c('applications', 'admitted_students', 'matriculants'),
    labels = c('Applications', 'Admitted Students', 'Matriculants'),
    values = c('#B6ACD1', '#836EAA', '#4E28A4')
  ) + 
  theme_classic() +
  theme(
    legend.justification = c(0.5, 1), 
    legend.position = 'inside', 
    legend.position.inside = c(0.5,1),
    legend.direction = 'horizontal',
    plot.title = element_text(hjust = 0.5)
  ) + 
  ggtitle('Northwestern University \nUndergraduate Admissions 2002-2022')
Code
# data wrangling
rate_data <- admin_data |> 
  select(year, contains('_rate')) |> 
  pivot_longer(
    cols = -year,
    names_to = 'category',
    values_to = 'value'
  ) |>
  mutate(
    y_label = case_when(
      category == 'admission_rate' ~ value - 2,
      category == 'yield_rate' ~ value + 2
    ),
    rate_labels = str_c(value, '%') # str_c concatenates values into a string
  )

# legend labels (shape, line, fill, color)
legend_labels <- c('Admission Rate', 'Yield Rate')

# building the plot 
line_plot <- rate_data |>
  filter (year >= 2002) |>
  ggplot(aes(x = year, y = value)) + 
  geom_line(aes(color = category)) + 
  geom_point(aes(fill = category, shape = category), color = 'white', size = 2) + 
  scale_y_continuous(
    name = 'Rate', 
    breaks = seq(0, 65, 5), 
    limits = c(0, 68),
    labels = scales::label_percent(scale = 1),
    expand = c(0, 0)
  ) + 
  geom_text(
    aes(y = y_label, label = rate_labels),
    size = 2,
    color = 'grey40'
  ) + 
  scale_x_continuous(
    name = 'Entering Year', 
    breaks = 2002:2022,
    expand = c(0, 0.35)
  ) + 
  scale_shape_manual(
    name = NULL,
    values = c(21, 24),
    labels = legend_labels
  ) + 
  scale_color_discrete(name = NULL, labels = legend_labels) + 
  scale_fill_discrete(name = NULL, labels = legend_labels) + 
  theme_classic() +
  theme(legend.justification = c(0.5, 1), 
        legend.position = 'inside',
        legend.position.inside = c(0.5, 1),
        legend.direction = 'horizontal',
        plot.title = element_text(hjust = 0.5)
  ) + 
  ggtitle('Northwestern University \nUndergraduate Admissions 2002-2022')
Code
bar_plot / line_plot

Footnotes

  1. Data is taken from the pdf. The file includes a few extra years.↩︎